128 research outputs found
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction
We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for
developing and evaluating grammatical error correction (GEC). Unlike other
corpora, it represents a broad range of language proficiency levels and uses
holistic fluency edits to not only correct grammatical errors but also make the
original text more native sounding. We describe the types of corrections made
and benchmark four leading GEC systems on this corpus, identifying specific
areas in which they do well and how they can improve. JFLEG fulfills the need
for a new gold standard to properly assess the current state of GEC.Comment: To appear in EACL 2017 (short papers
Robust Text Correction for Grammar and Fluency
Grammar is one of the most important properties of natural language. It is a set of structural (i.e., syntactic and morphological) rules that are shared among native speakers in order to engage smooth communication. Automated grammatical error correction (GEC) is a natural language processing (NLP) application, which aims to correct grammatical errors in a given source sentence by computational models.
Since the data-driven statistical methods began in 1990s and early 2000s, the GEC com- munity has worked on establishing a common framework for its evaluation (i.e., dataset and metric for benchmarking) in order to compare GEC models’ performance quantitatively. A series of shared tasks since early 2010s is a good example of this.
In the first half of this thesis, I propose character-level and token-level error correction algorithms. For the character-level error correction, I introduce a semi-character recurrent neural network, which is motivated by a finding in psycholinguistics, called the Cmabrigde Uinervtisy (Cambridge University) effect or typoglycemia. For word-level error correc- tion, I propose an error-repair dependency parsing algorithm for ungrammatical texts. The algorithm can parse sentences and correct grammatical errors simultaneously.
However, it is important to note that grammatical errors are not usually limited to mor- phological or syntactic errors. For example, collocational errors such as *quick/fast food and *fast/quick meal are not fully explained by only syntactic rules. This is another im- portant property of natural language, called fluency (or acceptability). Fluency is a level of mastery that goes beyond knowledge of how to follow the rules, and includes know- ing when they can be broken or flouted. In fact, the GEC community has also extended the scope of error types from closed class errors (e.g., noun numbers, verb forms) to the fluency-oriented errors.
The second half of this thesis investigates GEC while considering fluency as well as grammaticality. When it comes to “whole-sentence” correction, by extending the scope of errors considering fluency as well as grammaticality, the GEC community has overlooked the reliability and validity of the task scheme (i.e., evaluation metric and dataset for bench- marking). Thus, I reassess the goals of GEC as a “whole-sentence” rewriting task while considering fluency. Following the fluency-oriented GEC framework, I introduce a new benchmark corpus that is more diverse in various aspects such as proficiency, topics, and learners’ native languages.
Based on the fluency-oriented metric and dataset, I propose a new “whole-sentence” error correction model with neural reinforcement learning. Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes toward an objective that consid- ers a sentence-level, task-specific evaluation metric. I demonstrate that the proposed model outperforms MLE in human and automated evaluation metrics.
Finally, I conclude the thesis and outline ideas and suggestions for future GEC research
Test-time Augmentation for Factual Probing
Factual probing is a method that uses prompts to test if a language model
"knows" certain world knowledge facts. A problem in factual probing is that
small changes to the prompt can lead to large changes in model output. Previous
work aimed to alleviate this problem by optimizing prompts via text mining or
fine-tuning. However, such approaches are relation-specific and do not
generalize to unseen relation types. Here, we propose to use test-time
augmentation (TTA) as a relation-agnostic method for reducing sensitivity to
prompt variations by automatically augmenting and ensembling prompts at test
time. Experiments show improved model calibration, i.e., with TTA, model
confidence better reflects prediction accuracy. Improvements in prediction
accuracy are observed for some models, but for other models, TTA leads to
degradation. Error analysis identifies the difficulty of producing high-quality
prompt variations as the main challenge for TTA.Comment: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper
Hepatic Branch Vagotomy Can Suppress Liver Regeneration in Partially Hepatectomized Rats
The role of the vagus nerve in liver regeneration after partial hepatectomy was studied by comparing the
effects of hepatic branch vagotomy with those of hepatic branch sympathectomy in rats. The liver weight as a percentage of body weight decreased significantly 7 days after vagotomy compared with the controls and this was associated with a reduction in food intake. There was no difference in the liver weights between the control rats and the pair-fed vagotomized rats. Hepatic sympathectomy had no significant effect on the liver weight. The serum scores indicating hepatic function showed no difference between the control and the vagotomized rats except alkaline phosphatase. The concentration of insulin was unchanged. The number of mitotic hepatocytes remained high at 7 days after vagotomy: These
observations led us to conclude that the vagus nerve stimulates liver regeneration, and its effect depends on vagal factors directly and specifically
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011),
a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun
resolution problems originally designed to be unsolvable for statistical models
that rely on selectional preferences or word associations. However, recent
advances in neural language models have already reached around 90% accuracy on
variants of WSC. This raises an important question whether these models have
truly acquired robust commonsense capabilities or whether they rely on spurious
biases in the datasets that lead to an overestimation of the true capabilities
of machine commonsense. To investigate this question, we introduce WinoGrande,
a large-scale dataset of 44k problems, inspired by the original WSC design, but
adjusted to improve both the scale and the hardness of the dataset. The key
steps of the dataset construction consist of (1) a carefully designed
crowdsourcing procedure, followed by (2) systematic bias reduction using a
novel AfLite algorithm that generalizes human-detectable word associations to
machine-detectable embedding associations. The best state-of-the-art methods on
WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of
94.0%, depending on the amount of the training data allowed. Furthermore, we
establish new state-of-the-art results on five related benchmarks - WSC
(90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%).
These results have dual implications: on one hand, they demonstrate the
effectiveness of WinoGrande when used as a resource for transfer learning. On
the other hand, they raise a concern that we are likely to be overestimating
the true capabilities of machine commonsense across all these benchmarks. We
emphasize the importance of algorithmic bias reduction in existing and future
benchmarks to mitigate such overestimation
Causal schema induction for knowledge discovery
Making sense of familiar yet new situations typically involves making
generalizations about causal schemas, stories that help humans reason about
event sequences. Reasoning about events includes identifying cause and effect
relations shared across event instances, a process we refer to as causal schema
induction. Statistical schema induction systems may leverage structural
knowledge encoded in discourse or the causal graphs associated with event
meaning, however resources to study such causal structure are few in number and
limited in size. In this work, we investigate how to apply schema induction
models to the task of knowledge discovery for enhanced search of
English-language news texts. To tackle the problem of data scarcity, we present
Torquestra, a manually curated dataset of text-graph-schema units integrating
temporal, event, and causal structures. We benchmark our dataset on three
knowledge discovery tasks, building and evaluating models for each. Results
show that systems that harness causal structure are effective at identifying
texts sharing similar causal meaning components rather than relying on lexical
cues alone. We make our dataset and models available for research purposes.Comment: 8 pages, appendi
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
As large language models (LLMs) gain popularity among speakers of diverse
languages, we believe that it is crucial to benchmark them to better understand
model behaviors, failures, and limitations in languages beyond English. In this
work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national
medical licensing examinations from the past five years, including the current
year. Our team comprises native Japanese-speaking NLP researchers and a
practicing cardiologist based in Japan. Our experiments show that GPT-4
outperforms ChatGPT and GPT-3 and passes all six years of the exams,
highlighting LLMs' potential in a language that is typologically distant from
English. However, our evaluation also exposes critical limitations of the
current LLM APIs. First, LLMs sometimes select prohibited choices that should
be strictly avoided in medical practice in Japan, such as suggesting
euthanasia. Further, our analysis shows that the API costs are generally higher
and the maximum context size is smaller for Japanese because of the way
non-Latin scripts are currently tokenized in the pipeline. We release our
benchmark as Igaku QA as well as all model outputs and exam metadata. We hope
that our results and benchmark will spur progress on more diverse applications
of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.Comment: Added results from the March 2023 exa
- …